Predicting Wildfires¶
By: Eugenia Poon, Kayla Zhu
Motivation: The progressing issue of California wildfires in recent years has destroyed ecosystems, magnified air pollution and health risks, and weakened labor and resources.
Goal: predict the risk/presence/spread of fire
Table of Contents¶
Load and Clean Data
Feature Engineering
Modeling
%matplotlib inline
import os
import os.path
import random
import warnings
import zipfile
import geopandas as gpd
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf
import xarray as xr
from IPython.display import HTML, Javascript, display
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from imblearn.under_sampling import RandomUnderSampler
from matplotlib import pyplot as plt
from plotly import express as px, graph_objects as go
from sklearn.metrics import (
balanced_accuracy_score,
f1_score,
mean_absolute_error,
mean_squared_error,
precision_recall_fscore_support,
)
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.layers import Dense, Input, Normalization
from tensorflow.keras.models import Model
from tensorflow.keras.utils import to_categorical
from xgboost import XGBClassifier
from wildfireFunc import (
getDF,
getFireRange,
getInter,
getNC,
plotInter,
plotRes,
model,
plot_loss,
score,
split,
)
warnings.filterwarnings('ignore')
plt.rcParams['figure.figsize'] = (6, 4)
plt.rcParams['figure.dpi'] = 400
plt.rcParams['font.size'] = 5
plt.rcParams['figure.titlesize'] = 10
plt.rcParams['axes.linewidth'] = 0.1
plt.rcParams['patch.linewidth'] = 0
display(HTML("<style>.container { width:100% !important; }</style>"))
random.seed(100)
tf.random.set_seed(100)
np.random.seed(100)
Load and Clean Data¶
Shapefiles¶
- Only Northern Sierra Nevada: Tahoe, Plumas, and Lassen National Forest
- How to Query
| Sierra Nevada | Fire Perimeter | |
|---|---|---|
| SOURCE | Sierra Nevada Conservancy Boundary | CAL Fire |
| OPEN | View Data Source | View Data Source |
| OPEN | California Fire Perimeters (1950+) | |
| OPEN | Query | Query |
| Where: | 1=1 | YEAR_=2021 |
| Geometry Type: | Polygon | Polygon |
| Spatial Relationship: | Intersects | Intersects |
| Units: | Meters | Meters |
| Out fields: | * | * |
| Return Geometry: | True | True |
| Format: | GeoJSON | GeoJSON |
| Units: | Meters | Meters |
| CLICK | Query (GET) | Query (GET) |
# sierra nevada
url = 'https://services1.arcgis.com/sTaVXkn06Nqew9yU/arcgis/rest/services/Sierra_Nevada_Conservancy_Boundary/FeatureServer/0/query?where=1%3D1&objectIds=&time=&geometry=&geometryType=esriGeometryPolygon&inSR=&spatialRel=esriSpatialRelIntersects&resultType=none&distance=&units=esriSRUnit_Meter&relationParam=&returnGeodetic=false&outFields=*&returnGeometry=true&returnCentroid=false&featureEncoding=esriDefault&multipatchOption=xyFootprint&maxAllowableOffset=&geometryPrecision=&outSR=&defaultSR=&datumTransformation=&applyVCSProjection=false&returnIdsOnly=false&returnUniqueIdsOnly=false&returnCountOnly=false&returnExtentOnly=false&returnQueryGeometry=false&returnDistinctValues=false&cacheHint=false&orderByFields=&groupByFieldsForStatistics=&outStatistics=&having=&resultOffset=&resultRecordCount=&returnZ=false&returnM=false&returnExceededLimitFeatures=true&quantizationParameters=&sqlFormat=none&f=pgeojson&token='
snBoundaries = gpd.read_file(url).to_crs(epsg=4326)
snBounds = snBoundaries.geometry.bounds
bounds = [-122, 39.35, snBounds.iloc[0].maxx, 41]
boundaries = gpd.clip(snBoundaries, bounds)
bounds = boundaries.geometry.bounds
display(snBounds)
# fire perimeter
url1 = 'https://services1.arcgis.com/jUJYIo9tSA7EHvfZ/ArcGIS/rest/services/California_Fire_Perimeters/FeatureServer/2/query?where=YEAR_%3D'
url2 = '&objectIds=&time=&geometry=&geometryType=esriGeometryPolygon&inSR=&spatialRel=esriSpatialRelIntersects&resultType=none&distance=&units=esriSRUnit_Meter&relationParam=&returnGeodetic=false&outFields=*&returnGeometry=true&returnCentroid=false&featureEncoding=esriDefault&multipatchOption=xyFootprint&maxAllowableOffset=&geometryPrecision=&outSR=&defaultSR=&datumTransformation=&applyVCSProjection=false&returnIdsOnly=false&returnUniqueIdsOnly=false&returnCountOnly=false&returnExtentOnly=false&returnQueryGeometry=false&returnDistinctValues=false&cacheHint=false&orderByFields=&groupByFieldsForStatistics=&outStatistics=&having=&resultOffset=&resultRecordCount=&returnZ=false&returnM=false&returnExceededLimitFeatures=true&quantizationParameters=&sqlFormat=none&f=pgeojson&token='
years = ['2020', '2021']
firePer = pd.concat([gpd.read_file(url1 + y + url2) for y in years])
dateCol = ['ALARM_DATE', 'CONT_DATE']
firePer = firePer[firePer.STATE == 'CA'].dropna(subset=dateCol)
firePer.geometry = firePer.geometry.make_valid() # 2021 has invalid geometry
firePer = gpd.clip(firePer, boundaries) # fires in area
firePer[dateCol] = firePer[dateCol].apply(pd.to_datetime, unit='ms')
firePer = firePer.sort_values('ALARM_DATE').reset_index(drop=True)
print(f'{len(firePer)} fires')
display(firePer.sort_values('Shape__Area')[
['FIRE_NAME', 'Shape__Area']].tail())
| minx | miny | maxx | maxy | |
|---|---|---|---|---|
| 0 | -123.039382 | 35.064862 | -117.735384 | 41.994919 |
64 fires
| FIRE_NAME | Shape__Area | |
|---|---|---|
| 26 | LOYALTON | 3.198601e+08 |
| 41 | W-5 COLD SPRINGS | 6.049077e+08 |
| 58 | SUGAR | 7.236506e+08 |
| 34 | NORTH COMPLEX | 2.183903e+09 |
| 59 | DIXIE | 6.692866e+09 |
fig, ax = plt.subplots(nrows=1, ncols=1, dpi=600)
snBoundaries.plot(ax=ax, color='grey', alpha=0.1, linewidth=0)
boundaries.plot(ax=ax, color='grey', alpha=0.3, linewidth=0)
firePer.plot('YEAR_', legend=True, alpha=0.6, linewidth=0, ax=ax)
plt.tick_params(axis='both', labelsize=5)
ax.set(xlabel='longitude', ylabel='latitude')
plt.show()
Geospatial Data: GRIDMET¶
- GRIDMET (Download NetCDF Data)
- Clip 2021 and 2022 GRIDMET data to boundaries within Tahoe, Plumas, and Lassen National Forest area in Sierra Nevada
Description¶
- What starts a fire? high winds, high temperatures, low humidity
- Years: 2020-2021
- Resolution: 4638.3 meters = 4.6383 km = 1/24 degrees
- Temporal Resolution: daily
- Key inputs into the NFDRS model are: fuels, weather, topography and risks.
| variable | var | units | description |
|---|---|---|---|
| relative_humidity | rmin | % | minimum relative humidity |
| air_temperature | tmmx | K | maximum temperature |
| wind_speed | vs | m/s | wind velocity at 10m |
| burning_index_g | bi | NFDRS fire danger index | burning index |
path = 'data/gm.nc'
if os.path.isfile(path+'.zip') == False: # ~3 minutes
getNC(path, bounds, boundaries, years)
!rm data/gm.nc
ds = xr.open_dataset(zipfile.ZipFile(path+'.zip').open('gm.nc'))
print(ds.dims)
Frozen({'lon': 52, 'lat': 44, 'day': 731})
Interpolation¶
- Used for testing data with higher factor --> less data
- To get a lower resolution, choose a higher factor
- Lowering resolution will result in misleading performance scores because of increased generalization of an area
factor = 1
dd = getInter(ds, factor)
print(np.unique(abs(np.diff(dd.lon.to_numpy()))),
np.unique(abs(np.diff(dd.lat.to_numpy()))))
print()
res = abs(np.diff(dd.lon.to_numpy()))[0]
display(dd)
[0.04166667 0.04166667] [0.04166667 0.04166667]
<xarray.Dataset>
Dimensions: (lon: 52, lat: 44, day: 731)
Coordinates:
* lon (lon) float64 -122.1 -122.0 -122.0 ... -120.0 -119.9
* lat (lat) float64 41.07 41.03 40.98 ... 39.36 39.32 39.28
* day (day) datetime64[ns] 2020-01-01 2020-01-02 ... 2021-12-31
Data variables:
burning_index_g (day, lat, lon) float32 ...
relative_humidity (day, lat, lon) float32 ...
air_temperature (day, lat, lon) float32 ...
wind_speed (day, lat, lon) float32 ...
Attributes:
units: Unitless
description: BI-G
long_name: bi
standard_name: bi
dimensions: lon lat time
grid_mapping: crs
coordinate_system: WGS84,EPSG:4326plotInter(firePer, ds)
Feature Engineering¶
Create Fire Column Using Fire Perimeter¶
- Indicate whether a coordinate had a fire on a date based on Cal Fire's data
- Method:
- coordinates (lon, lat) are centers of pixels, create bigger polygons (squares) to include area around coordinate
- check if dates had a fire
- True: yes fire
- False: no fire
- Check if polygon had fire on a date
- True: yes fire on a date and at coordinate
- False: other
- How to get intersection
path = 'data/gm.pkl'
if os.path.isfile(path+'.zip') == False:
getDF(path, res, dd, firePer)
!rm data/gm.pkl
df = pd.read_pickle(zipfile.ZipFile(path+'.zip').open('gm.pkl'))
df['coor'] = df.groupby(['lat', 'lon']).ngroup()
display(df.info(verbose=False))
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1319455 entries, 0 to 1319454 Columns: 10 entries, lon to coor dtypes: datetime64[ns](1), float32(4), float64(2), int64(3) memory usage: 80.5 MB
None
# get data for 3 days before
fire = df.copy()
cols = ['burning_index_g', 'relative_humidity',
'air_temperature', 'wind_speed']
shifted = []
for i in np.arange(1, 4):
temp = fire.groupby('coor')[cols].shift(i)
temp = temp.add_suffix(i)
shifted.append(temp)
shifted = pd.concat(shifted, axis=1).fillna(method='bfill')
fire = pd.concat([fire, shifted], axis=1) # keep current
# fire = pd.concat([fire.drop(columns=cols), shifted], axis=1) # remove current
fire = fire.reindex(sorted(fire.columns), axis=1)
Get Nearby Extreme Data¶
For every coordinate of day x:
- get the surrounding coordinates (pixels around a pixel)
- get extreme data for two days before (day x-1 and x-2) for surrounding coords and the coord
- max air temperatures
- max burning index
- max wind speed
- min relative humidity
ll = df[['lat', 'lon']].drop_duplicates()
def getSurrounding(lat, lon, ll=ll, fire=fire):
n, s = lat + res, lat - res
e, w = lon + res, lon - res
ll = ll[(ll.lat <= n) & (ll.lat >= s) & (ll.lon <= e) & (ll.lon >= w)]
temp = fire[fire.lat.isin(ll.lat) & fire.lon.isin(ll.lon)]
# remove center
plus = temp.lat + temp.lon**2
temp = temp[plus != lat+lon**2]
temp['lat'] = lat
temp['lon'] = lon
return temp
# get surrounding coordinates
f = fire.loc[:, ~fire.columns.str.endswith('3')]
ex = pd.concat([getSurrounding(lat=g[0][0], lon=g[0][1])
for g in f.groupby(['lat', 'lon'])])
# get extreme data for 2 days before (surrounding)
extreme = (ex.groupby(['lat', 'lon', 'day'])
.agg(at1=('air_temperature1', max),
at2=('air_temperature2', max),
bi1=('burning_index_g1', max),
bi2=('burning_index_g2', max),
rh1=('relative_humidity1', min),
rh2=('relative_humidity2', min),
ws1=('wind_speed1', max),
ws2=('wind_speed2', max),))
extreme = (f.merge(extreme.reset_index(), on=['lat', 'lon', 'day'])
.drop(columns=['fire', 'lon', 'lat', 'relative_humidity',
'air_temperature', 'wind_speed']))
Categorize Burn Index¶
Burning index pg25 or 35 (Table)
| BI-1978 | Narrative Comments |
|---|---|
| 0-30 | Most prescribed burns are conducted in this range. |
| 30-40 | Generally represent the limit of control for direct attack methods. |
| 40-60 | Machine methods usually necessary or indirect attack should be used |
| 60-80 | The prospects for direct control by any means are poor above this intensity. |
| 80-90 | The heat load on people within 30 feet of the fire is dangerous. |
| 90-110+ | Above this intensity, spotting, fire whirls, and crowning should be expected. |
bins = [-1, 30, 40, 60, 80, 90, max(extreme['burning_index_g'])+1]
bins = [-1, 30, 60, 80, 90, max(extreme['burning_index_g'])+1]
labels = np.arange(len(bins)-1)
danger = to_categorical(pd.cut(extreme['burning_index_g'],
bins=bins, labels=labels))
extreme = pd.concat([extreme, pd.DataFrame(danger)], axis=1)
sns.countplot(x=np.argmax(danger, axis=1), palette=sns.color_palette('pastel'))
<Axes: ylabel='count'>
Sample¶
S = 100
fireS = fire.sample(n=100000, random_state=S)
extremeS = extreme.sample(n=100000, random_state=S)
aa = df.groupby('day').mean(numeric_only=True).drop(columns='fire')
bb = pd.DataFrame(StandardScaler().fit_transform(aa),
columns=aa.columns, index=aa.index)
bb = bb[cols].rolling(15).mean().reset_index()
px.line(bb, x=bb.day, y=cols).update_layout(margin=dict(t=0, b=0, r=0, l=0))
col = ['burning_index_g', 'relative_humidity', 'air_temperature', 'wind_speed']
n = ['fire_length', 'start_month']
fig1 = px.histogram(fireS, x=col, histnorm='percent')
fig2 = px.histogram(x=[len(x) for x in getFireRange(firePer)],
histnorm='percent')
fig2.update_traces(dict(name=n[0], offsetgroup=n[0], legendgroup=n[0],
marker_color=px.colors.qualitative.Plotly[5],
showlegend=True))
fig3 = px.histogram(x=firePer.ALARM_DATE.dt.month, histnorm='percent')
fig3.update_traces(dict(name=n[1], offsetgroup=n[1], legendgroup=n[1],
marker_color=px.colors.qualitative.Plotly[6],
showlegend=True))
fig = go.Figure(data=fig1.data+fig2.data+fig3.data)
fig.update_layout(margin=dict(t=15, b=0, r=0, l=0),
barmode='overlay', bargap=0.1)
fig.update_traces(visible='legendonly', opacity=0.7)
fig.data[0].visible = True
fig.show()
Modeling¶
Model 1: Classification Using Fire Perimeter¶
| Idea | predict if each coordinate had a fire on a day |
| Data | independent: indicator for fire using dates and coordinates given by fire perimeter |
| dependent: weather of the previous 3 days and others | |
| Metric | F1 considers precision and recall and not true negatives (no fire) |
| Maximize true positives or correctly predicting yes fires because incorrectly predicting yes fire as no fire is dangerous | |
| Issue | perimeters include all coordinates that had fire throughout one or more days |
| for fires that span multiple days, some coordinates might not have any fire until some days after the start of the fire | |
| model will still indicate fire even if weather conditions are normal |
target = ['fire']
temp = fireS.drop(columns=['burning_index_g', 'relative_humidity',
'air_temperature', 'wind_speed'])
X1, X_test1, y1, y_test1, X_train1, X_val1, y_train1, y_val1 = split(
temp, target)
print(f'{np.mean(y1) * 100:.2f}% yes fire\n')
mm1 = model(X_train1, y_train1, X_val1, y_val1, method='c')
4.17% yes fire Logistic Regression - Decision Tree - LDA - LightGBM - XGBoost -
| F1 | TPR | FPR | Precision | logloss | |
|---|---|---|---|---|---|
| XGBoost | 0.677863 | 0.595174 | 0.006955 | 0.787234 | 0.845023 |
| LightGBM | 0.616260 | 0.508043 | 0.006086 | 0.783058 | 0.945145 |
| Decision Tree | 0.523975 | 0.505362 | 0.018315 | 0.544012 | 1.371661 |
| Logistic Regression | 0.169022 | 0.761394 | 0.313377 | 0.095063 | 11.183545 |
| LDA | 0.015584 | 0.008043 | 0.001043 | 0.250000 | 1.517838 |
Refining Model: XGBoost¶
Model is not expected to perform better because of the previously mentioned issues
Class Imbalance
- majority (0/no fire) and minority (1/yes fire)
- severity of imbalance doesn't include enough 'yes fire' info for training
- accuracy is meaningless: if model classifies all no's right and yes's wrong, then accuracy would equal to the proportion of no's (~0.96)
- Possible solutions:
- adjust class weights to assign higher weights to minority class
- over and under-sampling with SMOTE to synthesize new samples for minority
For the training data
- SMOTE oversample minority class
- Under sample majority class
pipeline = Pipeline(steps=[('over', SMOTE(sampling_strategy=0.1)),
('under', RandomUnderSampler(sampling_strategy=0.5))
])
X_train1_, y_train1_ = pipeline.fit_resample(X_train1, y_train1)
xgb1 = XGBClassifier(eval_metric='aucpr', random_state=100)
xgb1.fit(X_train1_, y_train1_)
_, _, _, _, _ = score(y_val1, xgb1.predict(X_val1), method='c', p=True)
f1: 0.5789, TPR: 0.8901, FPR: 0.0512, Precision: 0.4289
Model 2: Classification Using Burn Index Classes¶
| Idea | predict the risk of fire based on prior nearby conditions |
| Data | independent: categorized burn index |
| dependent: weather from 2 days before and worst weather conditions of nearby values | |
| Metric | F1 |
| Issue | categorization may not be descriptive enough but may be useful for identifying next-day dangers assuming that fires are environmentally caused or continuous |
target = list(labels)
temp = extremeS.drop(columns='burning_index_g')
X, X_test, yC, y_testC, X_train, X_val, y_trainC, y_valC = split(temp, target)
y_trainC = np.argmax(y_trainC, axis=1)
pipeline = Pipeline(steps=[('over', SMOTE()),
('under', RandomUnderSampler())])
X_train_, y_trainC_ = pipeline.fit_resample(X_train, y_trainC)
y_trainC = pd.DataFrame(to_categorical(y_trainC), index=X_train.index)
y_trainC_ = pd.DataFrame(to_categorical(y_trainC_), index=X_train_.index)
def heat(true, pred, ax, title):
print(title, end=': ')
print(f"F1={f1_score(true, pred, average='weighted'):.3f}", end=' ')
print(f'Acc={balanced_accuracy_score(true, pred):.3f}')
pre, rec, f1b, sup = precision_recall_fscore_support(
true, pred, labels=(labels))
comp = pd.DataFrame([pre, rec, f1b, sup], index=['pre', 'rec', 'f1',
'sup']).add_prefix('class')
display(comp)
sns.heatmap(pd.crosstab(true, pred,
rownames=['Actual'], colnames=['Predicted'],
margins=True, normalize='index'
), cmap='Blues', annot=True, ax=ax)
ax.set(title=title)
return comp
fig, axs = plt.subplots(ncols=2, figsize=(15, 7))
# original
xgbC = XGBClassifier(eval_metric='aucpr', random_state=100)
xgbC.fit(X_train, y_trainC)
comp = heat(np.argmax(y_valC, axis=1), np.argmax(xgbC.predict(X_val), axis=1),
axs[0], 'original')
# smote
xgbC = XGBClassifier(eval_metric='aucpr', random_state=100)
xgbC.fit(X_train_, y_trainC_)
comp = heat(np.argmax(y_valC, axis=1), np.argmax(xgbC.predict(X_val), axis=1),
axs[1], 'smote')
plt.show()
original: F1=0.770 Acc=0.587
| class0 | class1 | class2 | class3 | class4 | |
|---|---|---|---|---|---|
| pre | 0.755582 | 0.805201 | 0.765748 | 0.771739 | 0.908163 |
| rec | 0.885100 | 0.770072 | 0.703981 | 0.154013 | 0.421801 |
| f1 | 0.815229 | 0.787245 | 0.733567 | 0.256781 | 0.576052 |
| sup | 6423.000000 | 7037.000000 | 3868.000000 | 461.000000 | 211.000000 |
smote: F1=0.750 Acc=0.629
| class0 | class1 | class2 | class3 | class4 | |
|---|---|---|---|---|---|
| pre | 0.710569 | 0.813910 | 0.776124 | 0.519298 | 0.603960 |
| rec | 0.884478 | 0.743357 | 0.620217 | 0.321041 | 0.578199 |
| f1 | 0.788043 | 0.777035 | 0.689467 | 0.396783 | 0.590799 |
| sup | 6423.000000 | 7037.000000 | 3868.000000 | 461.000000 | 211.000000 |
Model 3: Regression Using Burn Index¶
| Idea | predict the spread of fire based on prior nearby conditions |
| Data | independent: burn index |
| dependent: weather from 2 days before and worst weather conditions of nearby values | |
| Metric | RMSE |
| Issue | difficult to correctly predict zeros |
target = ['burning_index_g']
temp = extremeS.drop(columns=labels)
X, X_test, yR, y_testR, X_train, X_val, y_trainR, y_valR = split(temp, target)
mmR = model(X_train, y_trainR, X_val, y_valR, method='r')
plotRes(y_valR, mmR['XGBoost'][0].predict(X_val))
Linear Regression - Decision Tree - LightGBM - XGBoost -
| rmse | mae | |
|---|---|---|
| XGBoost | 13.303519 | 9.051157 |
| LightGBM | 13.791284 | 9.456384 |
| Linear Regression | 15.969732 | 10.757834 |
| Decision Tree | 17.770387 | 10.049889 |
Model 4: Neural Network¶
| Idea | predict the spread and risk of fire based on prior nearby conditions |
| Data | independent: burn index, categorized burn index |
| dependent: weather from 2 days before and worst weather conditions of nearby values | |
| Metric | RMSE, F1 |
| Issue | same as model 2, 3 |
def modelNN(X_train, y_trainC):
# normalize data
normalizer = Normalization(axis=-1)
normalizer.adapt(np.array(X_train))
# input layer
visible = Input(shape=(X_train.shape[1],))
# hidden layers
hidden1 = Dense(512, activation='relu',
kernel_initializer='he_normal')(normalizer(visible))
hidden2 = Dense(512, activation='relu',
kernel_initializer='he_normal')(hidden1)
# output layer
outR = Dense(1, activation='linear')(hidden2) # reg
outC = Dense(y_trainC.shape[1], activation='softmax')(hidden2) # class
model = Model(inputs=visible, outputs=[outR, outC])
model.compile(loss=['mse', 'categorical_crossentropy'], optimizer='adam',
metrics=['RootMeanSquaredError', 'F1Score'])
return model
%%time
nn4 = modelNN(X_train, y_trainC)
history = nn4.fit(X_train, [y_trainR, y_trainC], validation_split=0.3,
epochs=50, batch_size=64, verbose=0)
y_predR, y_predC = nn4.predict(X_val, verbose=0)
y_predC = np.argmax(y_predC, axis=1)
y_valC = np.argmax(y_valC, axis=1)
plot_loss(history)
CPU times: user 1min 22s, sys: 25.3 s, total: 1min 47s Wall time: 50.4 s
print('Regression: ', end=' ')
print(f'MAE={mean_absolute_error(y_valR, y_predR):.3f}', end=' ')
print(f'RMSE={mean_squared_error(y_valR, y_predR, squared=False):.3f}')
print('Classification:', end=' ')
plotRes(y_valR, y_predR)
fig, axs = plt.subplots()
compNN = heat(y_valC, y_predC, axs, 'Classification')
Regression: MAE=8.419 RMSE=12.219 Classification:
Classification: F1=0.787 Acc=0.604
| class0 | class1 | class2 | class3 | class4 | |
|---|---|---|---|---|---|
| pre | 0.878116 | 0.756366 | 0.744739 | 0.405405 | 0.784946 |
| rec | 0.844621 | 0.831604 | 0.704498 | 0.292842 | 0.345972 |
| f1 | 0.861043 | 0.792203 | 0.724060 | 0.340050 | 0.480263 |
| sup | 6423.000000 | 7037.000000 | 3868.000000 | 461.000000 | 211.000000 |
display(comp, compNN)
| class0 | class1 | class2 | class3 | class4 | |
|---|---|---|---|---|---|
| pre | 0.710569 | 0.813910 | 0.776124 | 0.519298 | 0.603960 |
| rec | 0.884478 | 0.743357 | 0.620217 | 0.321041 | 0.578199 |
| f1 | 0.788043 | 0.777035 | 0.689467 | 0.396783 | 0.590799 |
| sup | 6423.000000 | 7037.000000 | 3868.000000 | 461.000000 | 211.000000 |
| class0 | class1 | class2 | class3 | class4 | |
|---|---|---|---|---|---|
| pre | 0.878116 | 0.756366 | 0.744739 | 0.405405 | 0.784946 |
| rec | 0.844621 | 0.831604 | 0.704498 | 0.292842 | 0.345972 |
| f1 | 0.861043 | 0.792203 | 0.724060 | 0.340050 | 0.480263 |
| sup | 6423.000000 | 7037.000000 | 3868.000000 | 461.000000 | 211.000000 |
comp.loc['f1', 'class2':].values, compNN.loc['f1', 'class2':].values
(array([0.68946688, 0.39678284, 0.59079903]), array([0.72406005, 0.34005038, 0.48026316]))
def perform_bootstrap(X_test, y_testR, y_testC, models: dict, samp=500):
np.random.seed(100)
for bs_iter in range(samp):
bs_index = np.random.choice(
X_test.index, len(X_test.index), replace=True)
bs_data = X_test.loc[bs_index]
bs_testR = y_testR.loc[bs_index]
bs_testC = y_testC.loc[bs_index]
bs_testC = np.argmax(bs_testC, axis=1)
for m, model in models.items():
if model[0][1] == 'c':
try:
bs_pred = model[0][0].predict(bs_data, verbose=0)
bs_pred = np.argmax(bs_pred[1], axis=1)
except:
bs_pred = model[0][0].predict(bs_data)
bs_pred = np.argmax(bs_pred, axis=1)
acc = balanced_accuracy_score(bs_testC, bs_pred)
f1 = f1_score(bs_testC, bs_pred, average='weighted')
model[1].append([acc, f1])
elif model[0][1] == 'r':
try:
bs_pred = model[0][0].predict(bs_data, verbose=0)
bs_pred = bs_pred[0]
except:
bs_pred = model[0][0].predict(bs_data)
model[1].append([*score(bs_testR, bs_pred, method='r')])
return models
%%time
m = {
'XGBClassifier': [[xgbC, 'c'], []], # 2
'XGBRegressor': [[mmR['XGBoost'][0], 'r'], []], # 3
'Neural Network Class': [[nn4, 'c'], []], # 4C
'Neural Network Reg': [[nn4, 'r'], []], # 4R
}
bs = perform_bootstrap(X_test, y_testR, y_testC, models=m, samp=500)
CPU times: user 22min 48s, sys: 4min 19s, total: 27min 8s Wall time: 15min 10s
# Bootstrap: Classification
fig, ax = plt.subplots(nrows=1, ncols=2)
for i in [0, 2]:
sns.histplot(np.array(list(bs.values())[i][1])[:, 0], edgecolor='none',
color=sns.color_palette()[i], alpha=0.4, ax=ax[0])
sns.histplot(np.array(list(bs.values())[i][1])[:, 1], edgecolor='none',
color=sns.color_palette()[i], alpha=0.4, ax=ax[1])
fig.suptitle('Bootstrap: Classification')
ax[0].set_title('Accuracy')
ax[1].set_title('F1')
plt.legend(title='Model', labels=list(bs.keys())[0:3:2],
bbox_to_anchor=(1, 1.02), loc='upper left')
plt.show()
# Bootstrap: Regression
fig, ax = plt.subplots(nrows=1, ncols=2)
for i in [1, 3]:
sns.histplot(np.array(list(bs.values())[i][1])[:, 0], edgecolor='none',
color=sns.color_palette()[i], alpha=0.4, ax=ax[0])
sns.histplot(np.array(list(bs.values())[i][1])[:, 1], edgecolor='none',
color=sns.color_palette()[i], alpha=0.4, ax=ax[1])
fig.suptitle('Bootstrap: Regression')
ax[0].set_title('RMSE')
ax[1].set_title('MAE')
plt.legend(title='Model', labels=list(bs.keys())[1:4:2],
bbox_to_anchor=(1, 1.02), loc='upper left')
plt.show()
Final Model: Neural Network¶
Neural network scores are generally better compared to XGBoost models
- higher f1, accuracy not as important
- lower rmse and mae
However, a closer look at the heatmap in model 2 shows better classification predictions for smote in the last 3 classes or the ones that indicate more severe fires. For this case, combining NN and XGBoost may result in a better classifier
target = list(labels)
temp = extreme.drop(columns='burning_index_g')
X, X_test, yC, y_testC, X_train, X_val, y_trainC, y_valC = split(temp, target)
target = ['burning_index_g']
temp = extreme.drop(columns=labels)
X, X_test, yR, y_testR, X_train, X_val, y_trainR, y_valR = split(temp, target)
%%time
model = modelNN(X, yC)
history = model.fit(X, [yR, yC], validation_split=0.4,
epochs=50, batch_size=128, verbose=0)
y_predR, y_predC = model.predict(X_test, verbose=0)
y_predC = np.argmax(y_predC, axis=1)
y_testC = np.argmax(y_testC, axis=1)
plot_loss(history)
CPU times: user 17min 39s, sys: 3min 22s, total: 21min 2s Wall time: 7min 35s
print('Regression: ', end=' ')
print(f'MAE={mean_absolute_error(y_testR, y_predR):.3f}', end=' ')
print(f'RMSE={mean_squared_error(y_testR, y_predR, squared=False):.3f}')
fig, axs = plt.subplots()
comp1 = heat(y_testC, y_predC, axs, 'Classification')
Regression: MAE=5.102 RMSE=7.939 Classification: F1=0.862 Acc=0.733
| class0 | class1 | class2 | class3 | class4 | |
|---|---|---|---|---|---|
| pre | 0.916142 | 0.905095 | 0.759467 | 0.702970 | 0.637260 |
| rec | 0.946931 | 0.811714 | 0.911091 | 0.184868 | 0.812715 |
| f1 | 0.931282 | 0.855865 | 0.828398 | 0.292748 | 0.714372 |
| sup | 189809.000000 | 204949.000000 | 114162.000000 | 13058.000000 | 5804.000000 |
target = list(labels)
temp = extremeS.drop(columns='burning_index_g')
X, X_test, yC, y_testC, X_train, X_val, y_trainC, y_valC = split(temp, target)
yC = np.argmax(yC, axis=1)
pipeline = Pipeline(steps=[('over', SMOTE()),
('under', RandomUnderSampler())])
X_, yC_ = pipeline.fit_resample(X, yC)
yC_ = pd.DataFrame(to_categorical(yC_), index=X_.index)
# smote
fig, axs = plt.subplots()
xgbC = XGBClassifier(eval_metric='aucpr', random_state=100)
xgbC.fit(X_, yC_)
# weights depend on f1 score results in model 2 and 4 (more weight if higher f1)
comp2 = heat(np.argmax(y_testC, axis=1),
np.argmax(xgbC.predict_proba(X_test) * [0.2, 0.2, 0.2, 0.8, 0.8] +
model.predict(X_test, verbose=0)[1] * [0.8, 0.8, 0.8, 0.2, 0.2], axis=1),
axs, 'Neural Network + SMOTE Classification')
plt.show()
Neural Network + SMOTE Classification: F1=0.870 Acc=0.757
| class0 | class1 | class2 | class3 | class4 | |
|---|---|---|---|---|---|
| pre | 0.920011 | 0.904823 | 0.772666 | 0.728000 | 0.746606 |
| rec | 0.948527 | 0.817775 | 0.906993 | 0.367306 | 0.746606 |
| f1 | 0.934051 | 0.859100 | 0.834458 | 0.488263 | 0.746606 |
| sup | 14357.000000 | 15415.000000 | 8795.000000 | 991.000000 | 442.000000 |
old = np.mean(comp1.loc['f1', 'class2':])
new = np.mean(comp2.loc['f1', 'class2':])
print(100*(new-old) / old)
12.738027803879405
Conclusion¶
Although the neural network performed better, a persistent error involving zero burn index appeared in each tested model. This error could be due to the unpredictability of short-lived and/or unnatural fires using GridMET's daily aggregate data. Daily data does not allow for adequate prediction for these fires because unnatural fires may start abruptly regardless of weather conditions. For this reason, a finer temporal resolution by minutes or seconds may be more useful to detect sudden changes. Furthermore, the acquired data does not include any topological information that could be beneficial.
# save notebook
display(Javascript('IPython.notebook.save_checkpoint();'))
# save notebook as html to eugpoon.github.io/projects
!jupyter nbconvert wildfire.ipynb --to html
%mv "wildfire.html" "../eugpoon.github.io/projects/"
# restyle imports, clear output, replace file
!cleanipynb wildfire.ipynb
# restart kernel
display(HTML("<script>Jupyter.notebook.kernel.restart()</script>"))